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We consider the problem of approximating a given element / from 
a Hilbert space Ti by means of greedy algorithms and the application 
of such procedures to the regression problem in statistical learning 
theory. We improve on the existing theory of convergence rates for 
both the orthogonal greedy algorithm and the relaxed greedy algo- 
rithm, as well as for the forward stepwise projection algorithm. For all 
these algorithms, we prove convergence results for a variety of func- 
tion classes and not simply those that are related to the convex hull 
of the dictionary. We then show how these bounds for convergence 
rates lead to a new theory for the performance of greedy algorithms in 
learning. In particular, we build upon the results in [IEEE Trans. In- 
form. Theory 42 (1996) 2118-2132] to construct learning algorithms 
based on greedy approximations which are universally consistent and 
provide provable convergence rates for large classes of functions. The 
use of greedy algorithms in the context of learning is very appealing 
since it greatly reduces the computational burden when compared 
with standard model selection using general dictionaries. 

1. Introduction. We consider the problem of approximating a function 
/ from a Hilbert space 7i by a finite linear combination / of elements of 
a given dictionary V = {g) g ^v- Here, by dictionary, we mean any family of 
functions from 7i. In this paper, this problem is addressed in two different 
contexts: 
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(i) Deterministic approximation: / is a known function in a Hilbert 
space 7i. The approximation error is naturally measured by ||/ — /||, where 
|| • || is the corresponding norm generated by the inner product (•,•) on Ti, 
that is, ||si|| 2 := := (g,g). 

(ii) Statistical learning: / = f p , where f p (x) = E(y\x) is the regression 
function of an unknown distribution p on X x Y, with x € X and y 6 7 
respectively representing the feature and output variables, from which we 
observe independent realizations (zi) = (xi,yi) for i = 1, . . . ,n. The approx- 
imation error is now measured in the Hilbertian norm ||u|| 2 := E(\u(x)\ 2 ). 

In either of these situations, we may introduce the set Eat of all possible 
linear combinations of elements of T> with at most N terms and define the 
best iV-term approximation error <7jv(/) as the infimum of ||/ — /|| over all 
/ of this type, 

geA 

In the case where T> is an orthonormal basis, the minimum is attained by 
(1-2) /= E 

ffGAjvCf) 

where Ajv(/) corresponds to the coordinates c g := (/, g) which are the N- 
largest in absolute value. The approximation properties of this process are 
well understood; see, for example, [8]. In particular, one can easily check that 
the convergence rate ||/ — f\\ji < N~ s is equivalent to the property that the 
sequence (c ff ) 9e x> belongs to the weak space w£ p with l/p = 1/2 + s. We recall 
here that w£ p is the space of sequences (c g ) g ^T> such that the quasi- norm 
||(c fl )IUv defined by 

(1-3) :=sup^#({^|c s | >77», 

is finite. Note that ||(c s )||^ < J2 g ev \ c g\ P '■= \\( c g)\\e p i so that l v C w£ p (see, 
e.g., the survey [8] or standard books on functional analysis for a more 
expanded discussion of l p and weak i p spaces). Here, and later in this paper, 
we use the notation A < B to mean that A<CB for some absolute constant 
C that does not depend on the parameters which define A and B. 

One of the motivations for utilizing general dictionaries rather than or- 
thonormal systems is that in many applications such as signal processing and 
statistical estimation, it is not clear which orthonormal system, if any, is best 
for representing or approximating /. Thus, dictionaries which are unions of 
several bases or collections of general waveforms are preferred. Some well- 
known examples are the use of Gabor systems, curvelets and wavepackets in 



(1.1) 



inf inf 

#(A)<7V(c 9 ) 
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signal processing and neural networks in learning theory. Moreover, in statis- 
tical learning, orthonormality is meaningful in the norm ||ii|| 2 '■= E(\u(x)\ 2 ), 
which is usually out of reach since it depends on the unknown distribution 
of the random variable x. Therefore, non-orthonormal systems cannot be 
avoided in this context. 

When working with dictionaries T> which are not orthonormal bases, the 
realization of a best iV-term approximation is usually out of reach from 
a computational point of view since it would require minimizing ||/ — /|| 
over all / in an infinite or huge number of iV-dimensional subspaces. Greedy 
algorithms or matching pursuits aim to build "suboptimal yet good" TV- 
term approximations through a greedy selection of elements g^, k = 1, 2, . . . , 
within the dictionary T> and to do so with a more manageable number of 
computations. 

1.1. Greedy algorithms. Greedy algorithms have been introduced in the 
context of statistical estimation. They have also been considered for appli- 
cations in signal processing [1]. Their approximation properties have been 
explored in [4, 9, 14, 15, 18, 19] for general bounded dictionaries along with 
various applications. A recent survey of the approximation properties of such 
algorithms is given in [21]. 

There exist several versions of these algorithms. The four most commonly 
used are the pure greedy, the orthogonal greedy, the relaxed greedy and 
the stepwise projection algorithms, which we respectively denote by the 
acronyms PGA, OGA, RGA and SPA. We describe these algorithms in the 
deterministic setting. We shall assume here and later that the elements of 
the dictionary are normalized according to ||<7|| = 1 for all g E T> unless it is 
explicitly stated otherwise. 

All four of these algorithms begin by setting /o := 0. We then recursively 
define the approximant fk based on fk-\ and its residual r^-i := / — fk-i- 

In the PGA and the OGA, we select a member of the dictionary as 

(1.4) g k := Argmax|(rs;_i,5}|. 

The new approximation is then defined as 

(1-5) fk'-=fk-i + (rk-i,9k)9k 

in the PGA and as 

(1-6) f k = P k f 

in the OGA, where is the orthogonal projection onto Vk := Spanjgi , . . . , gk}- 
It should be noted that when T> is an orthonormal basis, both algorithms 
coincide with the computation of the best fc-term approximation. 
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In the RGA, the new approximation is defined as 
(1-7) fk = a> k fk-i + Pk9k, 

where (ct k ,f3 k ) are real numbers and g k is a member of the dictionary. There 
exist many possibilities for the choice of {a k , f3 k , g k ), the most greedy being 
to select them according to 

(1-8) (ot k ,p k ,g k ) := Argmin ||/ - af k -i - pg\\. 

(a,/3,g)€K 2 xD 

Other choices specify one or several of these parameters, for example, by 
taking g k as in (1.4) or by setting in advance the values of a k and k ; see, 
for example, [14] and [4]. Note that the RGA coincides with the PGA when 
the parameter a k is set to 1. 

In the SPA, the approximation f k is defined by (1.6), as in the OGA, but 
the choice of g k is made so as to minimize over all g £T> the error between 
/ and its orthogonal projection onto Span{<?i, . . . ,g k _i,g}. 

Note that from a computational point of view, the OGA and SPA are more 
expensive to implement since at each step, they require the evaluation of the 
orthogonal projection P k f (and, in the case of SPA, a renormalization) . Such 
projection updates are computed preferably using Gram-Schmidt orthogo- 
nalization (e.g., via the QR algorithm) or by solving the normal equations 

(1.9) G k a k = b k , 

where G k := ((gi, gj))i,j=i,..., k is the Gramian matrix, b k := ({f,gi))i=i,..., k 
and a k := (&j)j=i,...,k is the vector such that f k = J2j=i a j9j- 

In order to describe the known results concerning the approximation prop- 
erties of these algorithms, we introduce the class C\ := C\{T>) consisting of 
those functions / which admit an expansion / = J2gev c g9> w here the coef- 
ficient sequence (c g ) is absolutely summable. We define the norm 

(1-10) ll/lki:=irf{£ks|:/=£c^} 

for this space. This norm may be thought of as an i\ norm on the coefficients 
in the representation of the function / by elements of the dictionary; we 
emphasize that it is not to be confused with the L\ norm of /. An alternate, 
and closely related, way of defining the C\ norm is by the infimum of numbers 
V for which f /V is in the closure of the convex hull of PU (— T>). This is 
known as the "variation" of / with respect to V and was used in [16, 17], 
building on the earlier terminology in [3]. 
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In the case where T> is an orthonormal basis, we find that if / G C±, then 

1/2 

M/)=( E 




< ll/IUi 

V sga„(/) 



1/2 



For the PGA, it was proved in [9] that / G L\ implies that 
(1-12) H/-/jv||<iV~ 1/6 - 

This rate was improved to jV~ n / 62 in [15], but, on the other hand, it was 
shown in [19] that for a particular dictionary, there exists / G C\ such that 

(1-13) ll/-/7v||>A^ a27 . 

When compared with (1.11), we see that the PGA is far from being optimal. 

The RGA, OGA and SPA behave somewhat better: it was proven in [14] 
for the RGA and SPA and in [9] for the OGA that one has 

(1-14) ll/-^ll<ll/ll£ 1 A r " 1/2 
for all fe&. 

For each of these algorithms, it is known that the convergence rate N~ l l 2 
cannot generally be improved, even for functions which admit a very sparse 
expansion in the dictionary T> (see [9] for such a result with a function being 
the sum of two elements of T>). 

At this point, some remarks are in order regarding the meaning of the 
condition / G C\ for some concrete dictionaries. A commonly made state- 
ment is that greedy algorithms break the curse of dimensionality, in that 
the rate N~ 1 / 2 is independent of the dimension d of the variable space for 
/ and only relies on the assumption that / G C\. This is not exactly true 
since, in practice, the condition that / G C\ becomes more and more strin- 
gent as d grows. For instance, in the case where we work in the Hilbert space 
H := -L^QO, l] d ) and where T> is a wavelet basis ("0a), it is known that the 
smoothness property which ensures that / G £\ is that / should belong to 
the Besov space B((Li) with s = d/2, which roughly means that / has all 
of its derivatives of order less than or equal to d/2 in L\ (see [8] for the 
characterization of Besov spaces by the properties of wavelet coefficients). 
Moreover, for this to hold, it is required that the dual wavelets have 
at least d/2 — 1 vanishing moments. Another instance is the case where T> 
consists of sigmoidal functions of the type a(v ■ x — w), where a is a fixed 
univariate function, v is an arbitrary vector in W 1 , and w is an arbitrary real 
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number. For such dictionaries, it was proved in [4] that a sufficient condition 
to have / G C\ is the convergence of / |u>||.F/(u;)| du, where T is the Fourier 
operator. This integrability condition requires a larger amount of decay on 
the Fourier transform T f as d grows. Assuming that / G C\ is therefore 
more and more restrictive as d grows. Similar remarks also hold for other 
dictionaries (hyperbolic wavelets, Gabor functions, etc.). 

1.2. Results of this paper. The discussion of the previous section points 
to a significant weakness in the present theory of greedy algorithms, in that 
there are no viable bounds for the performance of greedy algorithms for 
general functions / G 7i. This is a severe impediment in some application 
domains (such as learning theory) where there is no a priori knowledge 
indicating that the target function is in C\. One of the main contributions of 
the present paper is to provide error bounds for the performance of greedy 
algorithms for general functions / G TC. We shall focus our attention on 
the OGA and RGA, which, as explained above, have better convergence 
properties in C\ than the PGA. We shall consider the specific version of the 
RGA in which a k is fixed at (1 — \/k)+, for some fixed A > 1, and then 
(Pkidk) are optimized. 

Inspection of the proofs in our paper shows that all further approximation 
results proved for this version of the RGA also hold for any greedy algorithm 
such that 

(1-15) ||/ - A|| < min ||/ - a k h-i + Pg\\, 

P,g 

irrespective of how ff. is defined. In particular, they hold for the more general 
version of the RGA defined by (1.8), as well as for the SPA. 

In Section 2, we introduce both algorithms and recall the optimal ap- 
proximation rate N~ 1 / 2 when the target function / is in C\. Later in this 
section, we develop a technique based on the interpolation of operators that 
provides convergence rates N~ s , < s < 1/2, whenever / belongs to a cer- 
tain intermediate space between C\ and the Hilbert space 7i. Namely, we 
use the spaces 

(1.16) B p :=[H,£i]e ! oo, 0:=2/p-l, Kp<2, 

which are the real interpolation spaces between Ti and L\. We show that if 
/ G Bp, then the OGA and RGA, when applied to /, provide approximation 
rates CN~ S with s := 9/2 = 1/p - 1/2. Thus, if we set B X =C\, then these 
spaces provide a full range of approximation rates for greedy algorithms. 
Recall, as discussed previously, for general dictionaries, greedy algorithms 
will not provide convergence rates better than N~ l l 2 for even the simplest 
of functions. The results we obtain are optimal in the sense that we recover 
the best possible convergence rate in the case where the dictionary is an 
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orthonormal basis. For an arbitrary target function / £ H, convergence of 
the OGA and RGA holds without rate. Finally, we conclude the section by 
discussing several issues related to the numerical implementation of these 
greedy algorithms. In particular, we consider the effect of reducing the dic- 
tionary D to a finite sub-dictionary. 

In Section 3, we consider the learning problem under the assumption 
that the data y := (yi, . . . ,y n ) are bounded in absolute value by some fixed 
constant B. Our estimator is built on the application of the OGA or RGA 
to the noisy data y in the Hilbert space defined by the empirical norm 

1 n 

(1-17) ll/Hn:=-£l/(^)| 2 

i=i 

and its associated inner product. At each step k, the algorithm generates an 
approximation to the data. Our estimator is defined by 

(1-18) f:=Tf k *, 
where 

(1.19) Tx :=Tbx := mm{B,\x\}sga(x) 

is the truncation operator at level B and the value of k* is selected by a 
complexity regularization procedure. The main result for this estimator is 
(roughly) that when the regression function f p is in B p [where this space 
is defined with respect to the norm ||u|| 2 := -E(|ii(x)| 2 )], the estimator has 
convergence rate 



(1-20) E{\\f-fX) 



N 2«/(l+2«) 

2\ < ' 



, log n , 

again with s := 1/p — 1/2. In the case where f„ £ C\, we obtain the same 
result with p = 1 and s = 1/2. We also show that this estimator is universally 
consistent. 

In order to place these results within the current state of the art of sta- 
tistical learning theory, let us first remark that similar convergence rates 
for the denoising and the learning problem could be obtained by a more 
"brute force" approach involving the selection of a proper subset of T> by 
complexity regularization with techniques such as those in [2] or in Chapter 
12 of [11]. Following, for instance, the general approach of [11], this would 
typically first require restricting the size of the dictionary T> [usually to be 
of size 0{n a ) for some a > 1] and then considering all possible subsets AcP 
and spaces Ga '■= Spanjg £ A}, each of them defining an estimator 



(1-21) fA-=T B Argmmlly-ZII; 

feg A 
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The estimator / is then defined as the f\ which minimizes 

(1.22) min{||y-/ A ||2+Pen(A,n)}, 

with Pen(A, n) a complexity penalty term. The penalty term usually restricts 
the size of A to be at most 0(n), but even then, the search is over 0{n an ) 
subsets. In some other approaches, the sets Q\ might also be discretized, 
transforming the subproblem of selecting f\ into a discrete optimization 
problem. 

The main advantage of using the greedy algorithm in place of (1-22) for 
constructing the estimator is a dramatic reduction of the computational cost. 
Indeed, instead of considering all possible subsets AcT>, the algorithm only 
considers the sets := {g\, . . . , g^}, k = 1, . . . ,n, generated by the empirical 
greedy algorithm. 

This approach was proposed and analyzed in [18] using a version of the 
RGA in which 

(1.23) a k +p k = l, 

which implies that the approximation f k at each iteration stays in the convex 
hull C\ of P. The authors established that if / does not belong to C±, then 
the RGA converges to its projection onto C\. In turn, the estimator was 
proven to converge in the sense of (1-20) to f p , with rate (n/logra) -1 / 2 , 
if f p lies in C\ and otherwise to its projection onto C\. In that sense, this 
procedure is not universally consistent. 

One of the main contributions of the present paper is to remove require- 
ments of the type f p G C\ when obtaining convergence rates. In the learning 
context, there is indeed typically no advanced information that would guar- 
antee such restrictions on f p . The estimators that we construct for learning 
are now universally consistent and have provable convergence rates for more 
general regression functions described by means of interpolation spaces. One 
of the main ingredients in our analysis of the performance of our greedy algo- 
rithms in learning is a powerful exponential concentration inequality which 
was introduced in [18]. Let us mention that a closely related analysis, which 
does not, however, involve interpolation spaces, is developed in [5, 13]. 

The most studied dictionaries in learning theory are in the context of 
neural networks. In Section 4, we interpret our results in this setting and, in 
particular, describe the smoothness conditions on a function / which ensure 
that it belongs to the spaces C\ or B p . 

Let us finally mention that there exist some natural connections between 
the greedy algorithms which are discussed in this paper and other numerical 
techniques for building a sparse approximation in the dictionary based on 
the minimization of an t\ penalized criterion. In the statistical context, these 
are the celebrated LASSO [12, 23] and LARS [10] algorithms. The relation 
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between t\ minimization and greedy selection is particularly transparent in 
the context of deterministic approximation of a function / in an orthonormal 
basis: if we consider the problem of minimizing 

2 



(1.24) 



f~Y. d 99 

g£T> 



g€T> 



over all choices of sequences (d g ), we see that it amounts to minimizing 
\c g — d g \ 2 + t\d g \ for each individual g, where c g := (f,g). The solution to 
this problem is given by the soft thresholding operator 

(1.25) d 9 :=| c 9 -^ign( C9 ), if \c g \ > -, 

1 0, otherwise 

and is therefore very similar to the results of selecting the largest coefficients 
of/. 

2. Approximation properties. Let P be a dictionary in some Hilbert 
space 7i, with \\g\\ = 1 for all g € T>. We recall that, for a given / GTC, the 
OGA builds embedded approximation spaces 

(2.1) V k :=Span{pi,...,5 fc }, A; = 1,2,..., 
and approximations 

(2.2) j k ■= Pkf, 

where Pk is the orthogonal projection onto Vk- The rule for generating the 
gk is as follows. We set Vq := {0}, /o := and r := / and, given Vk-i, 
fk-i = Pk-if and r fc _i := / - we define g k by 

(2.3) 5 fe := Argmax|(r fc _i,3)|, 

which defines the new Vk, fk and r&. 
In its most general form, the RGA sets 

(2.4) fk = ct k fk-i + Pkgk, 

where (a^, fik,gk) are defined according to (1.8). We shall consider a simpler 
version in which the first parameter is a fixed sequence. Two choices will be 
considered, namely 

(2.5) a k = 1 - i 
and 

2 

(2.6) ak = l-- iffc>l, ai = 0. 

K 
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The two other parameters are optimized according to 

(2.7) (Pk,9k) ■= Argmin || / - a k f k -i ~ H\- 

(/3,g)eMxD 

Since 

(2.8) ||/ - a k f k -x - (3gf = (3 2 - 2(3{f - a k f k - U g) + ||/ - a k f k ^\\ 2 , 
it is readily seen that (f3 k ,g k ) are given explicitly by 

(2.9) k = (f - a k fk-i,9k) 
and 

(2.10) g k :=Argmax|(/-a fc / fc _i,c/)|. 

geT> 

Therefore, from a computational point of view, this RGA is very similar to 
the PGA. 

We denote by C p the functions / which admit a converging expansion 
f = J2c g g with J2\c g \ P <+°° and we write \\f\\c p = inf ||(c fl )||* p , where the 
infimum is taken over all such expansions. In a similar way, we consider the 
spaces wC p corresponding to expansions which are in the weak space wt v . 
We denote by cjv(/) the best A-term approximation error in the Ti norm 
for / and for any s > 0, we define the approximation space 

(2.11) A':={f€H: a N (f) < MN~ S ,N =1,2,.. .}. 

Finally, we denote by Q s the set of functions / such that the greedy algorithm 
under consideration converges with rate ||rjv|| < N~ s , so that, obviously, 
Q s CA S . 

In the case where T> is an orthonormal basis, the space ^4 S contains the 
space C p with l/p= 1/2 + s and, in fact, actually coincides with the weak 
versions wC p of these spaces. In those cases, an algorithm for building a 
best (or near best) A^-term approximation is simply to keep the N largest 
coefficients of / and discard the others. The best A-term approximation is 
also obtained by the orthogonal greedy algorithm so that, obviously, A s = 

2.1. Approximation of C\ functions. In this section, we recall, for con- 
venience, the approximation properties of the OGA and RGA for functions 
/ G L\. We first recall the result obtained in [9] for the OGA. We shall make 
use of the following fact: if /, g £ Ti with ||g|| = 1, then {f,g)g is the best 
approximation to / from the one-dimensional space generated by g and 



(2.12) 



ll/-(/,5)ffl| 2 = ll/H 2 -K/,5)| 2 . 
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Theorem 2.1. For all f e C\, the error of the OGA satisfies 

(2.13) IM<||/lki(Ar + l)~ 1/2 , iV = l,2,.... 

Proof. Since f k is the best approximation to / from we have, from 
(2.12), 

(2.14) ||r fc || 2 < ||r fc „i - (rfc.i,^)^!) 2 = ||r fc _i|| 2 - \(r k _ u g k )\ 2 , 

with equality in the case where g k is orthogonal to V k -i- Since r k _\ is 
orthogonal to fk—i, we have 

(2.15) ||r fc _i|| 2 = (r fc _i,/)< sup |(r fe _i, 5 )| = ||/|| A |(r fe _i, 5fe )|, 

which, combined with (2.14), gives the reduction property 
(2-16) llrfcll^llrfc-ifCl-llrfc-ifH/ll^ 2 ). 



We also know that ||ri|| < ||ro|| = ||/|| < 

We then check by induction that a decreasing sequence (a n ) n >o of nonneg- 
ative numbers which satisfy ao < M and a k < ajt_i(l — ^ff-) for all k > has 
the decay property a n < -^r for all n > 0. Indeed, assuming that a n _i < ^■ 



M 

n+1 

n+1 ' " — n+1 ' " x — n+1 • 



for some n > 0, then either a n -\ < ^xr, so that a n < ^xr, or else a n _i > ' 
so that 



M 1 \ M 

(2.17) a n <— 1 = . 

y ' n\ n+lj n+1 

The result follows by applying this to a k = 1 1 t^. 1 1 2 and M := ||/||^ 5 since we 
indeed have 

(2-18) «o = ||/|| 2 <||/|| 2 c 1 . □ 

We now turn to the RGA, for which we shall prove a slightly stronger 
property. 

Theorem 2.2. For all f e L\, the error of the RGA using (2.5) satisfies 
(2.19) \\r N \\ < (\\ff Cl - Wff^N' 1 / 2 , N = l,2,.... 

Proof. From the definition of the RGA, we see that the sequence f k 
remains unchanged if the dictionary D is symmetrized by including the 
sequence (—g) ge x>- Under this assumption, since / £ C\, for any e > 0, we 
can expand / according to 

(2-20) / = ]T b g g, 
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where all of the b g are nonnegative and satisfy 



(2-21) £ 6 * = ll/ILci+*. 

with < 5 < e. According to (2.7), we have, for all (3 G ' 
\\r k f< \\f-a k f k ^ 
1 



and all j£P, 



a fe r fe _i + rf-09 



2 n n2 

"felrfc-ill 



2a k (rk-i, -j^f 



(39) + 



a k\\ r k-i\\ -2a k (r k _ 1 ,-f - I3g) + 



k 2 



f-Pg 



+ P 2 



2P 
k 



</,<?>• 



This inequality holds for all g G T>, so it also holds on the average with 



weights 



(2.22) 



E, 



-, which gives, for the particular value (3 



Ct+6), 



kfcll 2 < CK|||r- fc _ 1 1| 2 - 



+ (3 2 



Therefore, letting e tend to 0, we obtain 

2 



(2.23) 



\ r k\ 



< 



1 

1 - - 

k 



\ r k-i\ 



+ k 2 



2 



We now check by induction that a sequence (a k ) k> Q of positive numbers 
such that a\ < M and a k < (1 — j:) 2 a k _i + p-M for all k > has the decay 
property a n < ^ for all n > 0. Indeed, assuming that a„_i < r^j, we write 

a n 



M 

— < 

n 



1 

1 - - 

n 



M 



n 



1 



1 „ M 
I^M 



M 



n 



1 1 

- + -n 



n- 



n- 



The result follows by applying this to 
since (2.23) also implies that a± < M . □ 



1 

n 

In 



11 



= 0. 

2 and M :■- 



2 



The above results show that for both OGA and RGA, we have 



(2.24) 



dcg 1/2 ca 1/2 . 



From this, it also follows that wC p C A s with s = 1/p— 1/2 when p < 1. In- 
deed, from the definition of wC p , any function / in this space can be written 
as / = Yl'jLi c j9j ) with each gj G V and the coefficients Cj decreasing in ab- 
solute value and satisfying \cj\ < Mj~ l / p , j > 1, with M := ||/||u,£ p . There- 
fore, f = f a + fb with /„ := EjLiCjgj and ||/ 6 || £l < CpMN 1 ' 1 ^. It follows 
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from Theorem 2.1, for example, that f b can be approximated by an iV-term 
expansion obtained by the greedy algorithm with accuracy ||/& — P/v/&|| < 
CpMN 1 / 2 - 1 ^ = N" s and therefore, by taking f a + P N f b as a 27V-term ap- 
proximant of /, we obtain that / G A s . Observe, however, that this does 
not mean that / £Q S in the sense that we have not proven that the greedy 
algorithm converges with rate N~ s when applied to /. It is actually shown 
in [9] that there exist simple dictionaries such that the greedy algorithm 
does not converge faster than iV" 1 / 2 , even for functions / which are in wC p 
for all p > 0. 

2.2. Approximation of general functions. We now want to study the be- 
havior of the OGA and RGA when the function / G H is more general, in the 
sense that it is less sparse than being in L\. The simplest way of expressing 
this would seem to be by considering the spaces C p or wC p with 1 < p < 2. 
However, for general dictionaries, these spaces are not well adapted, since 
ll/IUp does not control the Hilbert norm ||/||. 

Instead, we shall consider the real interpolation spaces 

(2.25) B p = [n,c 1 ]e >00 , 0<9<1, 

again with p defined by 1/p = 6 + (1 - 9)/2 = (1 + 0)/2. Recall that / G 
[X, Y]g t00 if and only if for all t > 0, we have 

(2.26) K(f,t)<Ct e , 
where 

(2.27) K(f,t) := K(f,t,X,Y) := inf {||/ - h\\ x + t\\h\\ Y } 

is the so-called K- functional. In other words, / can be decomposed into 
/ = fx + fy, with 

(2.28) \\fx\\x+t\\fy\\y<Ct e . 

The smallest C such that the above holds defines a norm for Z = [X,Y]g tOQ . 
We refer to [7] or [6] for an introduction to interpolation spaces. The space 
Bp coincides with wC p in the case where T> is an orthonormal system, but 
may differ from it for a more general dictionary. 

The main result of this section is that for both the OGA and the RGA, 

(2.29) |M| <C K(/,iV- 1/2 ,W,£i), iV=l,2,..., 

so that, according to (2.26), / G B p implies the rate of decay \\rn\\ < N~ e / 2 . 
Note that if f^ were obtained as the action on / of a continuous linear 
operator from H onto itself such that ||£jv|| < C with C independent of 
N, then we could write, for any hE Ci, 

11/ - f N \\ <\\(I- L N )[f -h]\\ + \\h - L N h\\ 

(2.30) 

KWf-hW + WhW^N- 1 / 2 , 
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so that (2.29) would follow by minimizing over h E C\. However, fx is ob- 
tained by a highly nonlinear algorithm and it is therefore quite remarkable 
that (2.29) still holds. We first prove this for the OGA. 

Theorem 2.3. For all f and any h G C±, the error of the OGA 
satisfies 

(2.31) IM| 2 < ||/-/i|| 2 + 4||/i||| i iV- 1 , N = l,2,..., 

and therefore 

WrNW^Kif^N- 1 / 2 ^,^) 

(2.32) 

<2K(f,N~ 1 / 2 ,H,£i), AT = 1,2,.... 

Proof. Fix an arbitrary / G Ti. For any h G C\, we write 

(2.33) ||r fc _i|| a = (r k _ u h + f- h) < \\h\\ Cl \(rk-u 9k)\ + ||r fc _i|| \\f - h\\, 
from which it follows that 

(2.34) llr^H 2 < WhWcAirk-uQk)] + HW^-if + 11/ - hf). 
Therefore, using the shorthand notation := ||rfc|| 2 — ||/ — h\\ 2 , we have 

(2.35) \{rk-i,9k}\> 



Note that if for some fco, we have ||rfc _i || < ||/ — h\\, then the theorem holds 
trivially for all N > ko — 1- We therefore assume that a^-i is positive, so 
that we can write 

,- 2 

(2.36) \(r k -i,9k)\ 2 > 



2 ^ a k-l 



From (2.14), we therefore obtain 

n 2 

Co o'7^ II Il2 ^ n ||2 u k-l 

(2-37) IMI ^Hr^H - i p j jj-, 

which, by subtracting ||/ — h\\ 2 , gives 

o-k-i 



(2.38) a fc <a fc _! 1 



411^111 



As in the proof of Theorem 2.1, we can conclude that a^ < 4 1 1 | TV 1 , 
provided that we initially have a± < 4||/i|| 2 2 . In order to check this initial 
condition, we remark that either ao < 4||/i||^ , so that the same holds for a\, 
or ao > 4||/t||£ , in which case a\ < according to (2.38), which means that 
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we are already in the trivial case ||n|| < ||/ — h\\ for which there is nothing 
to prove. We have therefore obtained (2.31) and (2.32) follows by taking the 
square root. □ 

We next treat the case of the RGA, for which we have a slightly different 
result. In this result, we use the second choice (2.6) for the sequence a k in 
order to obtain a multiplicative constant equal to 1 in the term ||/ — h\\ 2 
appearing on the right side of the quadratic bound. This will be important 
in the learning application. We also give the nonquadratic bound with the 
first choice (2.5) since it gives a slightly better result than by taking the 
square root of the quadratic bound based on (2.6). 

Theorem 2.4. For all f and any h £ C±, the error of the RGA 
using (2.6) satisfies 

(2.39) \\r N \\ 2 < ||/ - hf + 4(^11^ - \\hf)N-\ N = 1, 2, . . . , 
and therefore 

\\r N \\<K(f,2N-V 2 ,H,£i) 

(2.40) 

<2K(f,N- 1 / 2 ,H,£ 1 ), N = 1,2, — 
Using the first choice (2.5), the error satisfies 

(2.41) \\r N \\ < ||/ - h\\ + (11*11^ - ||/ l || 2 ) 1 /2 iV -V2 ) AT = 1,2, ... , 
and therefore 

(2.42) WrNW^Kif^^/^HXi), N = l,2,.... 

Proof. Fix / £7i and let h G L\ be arbitrary. Similarly to the proof of 
Theorem 2.2, for any e > 0, we can expand h as 

(2.43) h=J2 b g 9, 

where all of the b g are nonnegative and satisfy 

(2.44) E b ff = H%i+ <5 ' 

with < 5 < e. Using the notation cek = 1 — we have, for all /3 € M and 
all g G V, 

\\r k \\ 2 < Wf-akfk^-pgf 

2 



llafcrfc.x + Qfc/- 

a|||r fc _i|| 2 - 2a k (r k _ 1 ,a k f - (3g) + \\a k f ■ 
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= afc||r fc „i|| 2 - 2a>k(rk-i,a k f - fig) + \\a k (f - h) + a k h - (3g\\ 2 

= ct|||r- fc _ 1 1| 2 - 2n k (r k _i,a k f - fig) + a 2 k \\f - h\\ 2 

+ 2a k (f - h,a k h - (3g) + a\\\h\\ 2 - 2f3a k {h,g) + (3 2 . 

This inequality holds for all g GT>, so it also holds on the average with 
weights ^ bg , , which gives, for the particular value (3 = a/ c (||/i||/: 1 + 5), 

\\r k f < alllr^xH 2 - 2a k a k (r k ^, f - h) + a 2 \\f - hf - a 2 \\h\\ 2 + (3 2 

= \\a k r k _ x - a k (f - h)\\ 2 - a 2 k \\h\\ 2 + f . 

Letting e tend to and using the notation M := — ||^-|| 2 , we thus 

obtain 

(2.45) |H| 2 < K||r fe _i|| + a k \\f - h\\) 2 + a 2 k M. 

Note that this immediately implies the validity of (2.39) and (2.41) at N = 1, 
using the fact that a\ = for both choices (2.5) and (2.6). We next proceed 
by induction, assuming that these bounds hold at k — 1. 
For the proof of (2.41), we derive from (2.45) that 

l/2\ \ 2 



M| 2 < (a k (\\f - h\\ + ( Jjl.) ) +a h \\f - h\\J 

M \l/2\2 



+ a 2 k M 



= \\f-h\\ 2 + 2M^\\f - hW 1 -^ + m(^^£ + 1 
= \\f-hf + 2MV 2 \\f-h\\^E± + f 

ll/-ll + ^ 1/v 



k J 

which is the desired bound at k. 

For the proof of (2.39), we derive from (2.45) that 

Until 2 < a 2 k \\r k _i\\ 2 + 2a fc a fc ||r fc _i||||/ - h\\ 

(2.46) 

+ a 2 k \\f-h\\ 2 + a 2 k M. 
Noting that 2a k a k \\r k ^i\\ ||/ - h\\ < a k a k {\\r k _i \\ 2 + \\f - h\\ 2 ), we obtain 
(2.47) ||r fc || 2 < a^r^f + a k \\f - hf + a 2 k M 
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and therefore 

IHI 2 -II/-/>II 2 



(2.48) 



<« fc (lk fc _i|| 2 -||/-/ i || 2 ) + alM. 



Assuming that (2.39) holds at N = k — 1, we thus obtain 



AM AM 



P.49) hf -|l/-^<(l4)^ 

We have thus obtained (2.39), and (2.40) follows by taking the square root. 
□ 

Remark 2.5. More generally, we can consider the RGA with the choice 

(2.50) a k = (l - - 

where A is some fixed parameter. If A > 1, an argument similar that used in 
the proof of (2.39) gives the general estimate 

(2.51) ||r^|| a <||/-fc|| 2 + C(||fc||i 1 -||fc|| 2 )J\r 1 , 

with C = ^rf- A specific feature of this estimate is that the term ||/ — h\\ 2 
has a multiplicative constant equal to 1, which will be of critical importance 
in the learning context. We do not know if such an estimate can be obtained 
if 0< A< 1. 

An immediate consequence of Theorems 2.3 and 2.4 combined with the 
definition of the B p spaces [see (2.26)] is a rate of convergence of the OGA 
and RGA for the functions in B p . 

Corollary 2.6. For all f £ B p , the approximation errors for both the 
OGA and RGA satisfy the decay bound 

(2-52) IM|<||/lkiV- s , 

with s = l/p — 1/2. Therefore, we have B p C Q s C A s . 

In addition, when T> is a complete family in H, we know that d is dense 
in TC, so 

(2.53) limK(f,t,H,£i) = 

for any / E 7i. This implies the following corollary. 

Corollary 2.7. For any f G Ti, the approximation error ||rjv|| tends 
to zero as N — > +oo for both the OGA and RGA. 
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2.3. Greedy algorithms with a truncated dictionary. In concrete appli- 
cations, it is not possible to evaluate the supremum of \(rk-i,g}\ over the 
whole dictionary T>, but only over a finite subset of it. For applications in 
learning theory, it will also be useful that the size of this subset has at most 
polynomial growth in the number of samples n. We therefore introduce a 
fixed exhaustion of T>, 

(2.54) V 1 C V 2 C • • • C V, 

with #(D m ) — m. The analysis we present in this section is similar to that 
in [22]. We are now interested in the functions / which can be approximated 
to a certain accuracy by application of the OGA only using the elements of 
V m . For this purpose, we first introduce the space C\(V m ) of those functions 
in Span('D m ) equipped with the (minimal) i\ norm of the coefficients. We 
next define, for r > 0, the space C\^ r to be the set of all functions / such 
that, for all m, there exists h (depending on m) such that 

(2-55) \\h\\ Cl{Vm) <C 

and 

(2.56) \\f -h\\<Cm~ r . 

The smallest constant C such that this holds defines a norm for C\ )T . In 
order to understand how these spaces are related to the space L\ for the 
whole dictionary, consider the example where T> is a Schauder basis and 
consider the decomposition of / into 

f=J2 c n9+J2 c 99 

(2.57) 

= h + f-h. 

It is then obvious that ||^||£ 1 (x> m ) < H/Ha- Therefore, a sufficient condition 
for / to be in L\ yT is / G C\ and its tail || J2 g ^v m c gd\\ decays like m~ r . 

Application of Theorems 2.3 and 2.4 shows us that if we apply the OGA 
or RGA with the restricted dictionary and if the target function / is in £i, r , 
then we have 

(2-58) ||r fc ||<Co||/|| £l , r (A;- 1/2 + m-), 

where Co is an absolute constant [Co = 2 for OGA and Co = 1 for RGA with 
choice (2.5)]. 

In a similar manner, we can introduce the interpolation space 

(2.59) B p , r :=[H,jCi, r ]e,oo, 

again with 1/p = (1 + 9)/2. From the definition of interpolation spaces, if 
/ G B p<r , then for all t > 0, there exists / S L\ iT such that 

(2.60) n/n £lir < wnw/- 1 
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and 

(2-61) \\f-f\\<\\f\\B P , r t e . 

We also know that for all m, there exists h (depending on m) such that 
(2-62) \\h\\c l{ v m) < II/IIa,, < \\f\Krt 9 - 1 



and 
(2.63) 



\\f-h\\<\\f\\ Cl , r m^ 



so that, by the triangle inequality, 

(2-64) ll/-*ll<ll/lk r (^ + t'- 1 m- r ). 

Application of Theorems 2.3 and 2.4 shows us that if we apply the OGA 
or RGA with the restricted dictionary and if the target function / is in B Pl r , 
then we have, for any t > 0, 

(2.65) II^H^CoII/IIb^^-^-^ + ^ + ^-W). 
In particular, taking t = k~ 1 / 2 and noting that 6 = 2s gives 

(2.66) ||r fc || <C Q \\f\\ Bihr {k- s + k 1 l 2 ~ s m-r). 

We therefore recover the rate of Corollary 2.6 up to an additive perturbation 
which tends to as m — ► +oo. 

Let us conclude this section by making some remarks on the spaces B PtT . 
These spaces should be viewed as being slightly smaller than the spaces B p . 
The smaller the value of r > 0, the smaller the distinction between B p and 
B p>r . Also, note that the classes B p>r depend very much on how we exhaust 
the dictionary D. For example, if T> = Bq U B\ is the union of two bases 
-Bo and B\ , then exhausting the elements of Bq faster than those of B\ will 
result in different classes than if we exhaust those of B\ faster than those of 
So- However, in concrete settings, there is usually a natural order in which 
to exhaust the dictionary. 

3. Application to statistical learning. 

3.1. Notation and definition of the estimator. We consider the classical 
bounded regression problem. We observe n independent realizations (zi) = 
(xi,yi), i= 1, . . . , n, of an unknown distribution p on Z = X x Y. We assume 
here that the output variable satisfies almost surely 

(3.1) \y\<B, 
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where the bound B is known to us. We denote by 

(3.2) f p (x) = E(y\x) 

the regression function which minimizes the quadratic risk 

(3-3) R{f):=E{\f{x)-y\ 2 ) 

over all functions /. For any /, we have 

(3-4) R(f) -R(f P ) = ||/ -ipf, 

where we use the notation 

(3-5) \\u\\ 2 :=E(\u(x)\ 2 ) = \\u\\ 2 L2{px) , 

with px being the marginal probability measure defined on X. We are 
therefore interested in constructing from our data an estimator / such that 
||/ — f p || 2 is small. Since / depends on the realization of the training sam- 
ple z := (zi) £ Z n , we shall measure the estimation error by the expectation 
E(\\f — fp\\ 2 ) taken with respect to p n . 

Given our training sample z, we define the empirical norm 

1 n 

(3-6) ll/ll^:=-El/(^)| 2 - 

71 . 
1=1 

Note that || • || n is the L2 norm with respect to the discrete measure f x := 
z:YA=i3xii with 5 U the Dirac measure at u. As such, the norm depends on 
x := (xi, . . . ,x n ) and not just on n, but we adopt the notation (3.6) to con- 
form with other major works in learning. We view the vector y := (y±, . . . , y n ) 
as a function y defined on the design x := (rui, . . . , x n ) with y(xi) = yi. Then, 
for any / defined on x, 

1 n 

(3-7) l|y-/H 2 :=-El^-/(^)| 2 

1=1 

is the empirical risk for /. 

In order to estimate f p from the given data z, we shall use the greedy 
algorithms OGA and RGA described in the previous section. We choose an 
arbitrary value of a > 1 and then fix it. We consider a dictionary T> and 
truncations of this dictionary T>i,T>2, ■ ■ ■ , as described in Section 2.3. We 
will use approximation from the span of the dictionary T> m in our algorithm, 
where we assume that the size m is limited by 

(3.8) m < m(n) := [n a \ . 



Our estimator is defined as follows: 
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(i) Given a data set z of size n, we apply the OGA, SPA or the RGA 
for the dictionary T> m to the function y using the empirical inner product 
associated with the norm || • || n . In the case of the RGA, we use the second 
choice (2.6) for the parameter a^. This gives a sequence of functions (ifc)£L 
defined on x. 

(ii) We define the estimator / := Tf^*, where Tu := Tg min{S, |u|}sgn(u) 
is the truncation operator at level B and k* is chosen to minimize (over all 
k > 0) the penalized empirical risk 

3.9 \\y-Tf k \\ 2 n + K 2_, 

n 

with K>0a constant to be fixed later. 

We shall make some remarks about this algorithm. First, note that for 
k = 0, the penalized risk is bounded by B 2 since /o = and \y\ < B. This 
means that we need not run the greedy algorithm for values of k larger than 
Bn/n. Second, our notation / suppresses the dependence of the estimator 
on z, which is again customary notation in statistics. The application of 
the kth step of the greedy algorithms requires the evaluation of 0{n a ) inner 
products. In the case of the OGA, we also need to compute the projection of 
y onto a fc-dimensional space. This could be done by doing Gram-Schmidt 
orthogonalization. Assuming that we had already computed an orthonormal 
system for step k — 1, this would require an additional evaluation of k — 1 
inner products and then a normalization step. Finally, the truncation of the 
dictionary T> is not strictly needed in some more specific cases such as neural 
networks (see Section 4). 

In the following, we want to analyze the performance of our algorithm. 
For this analysis, we need to assume something about f p . To impose condi- 
tions on f p , we shall also view the elements of the dictionary normalized in 
the L<2(px) norm || • ||. With this normalization, we denote by C\, B p , Ci iT 
and B Pjr the spaces of functions that have been previously introduced for a 
general Hilbert spaces TL. Here, we have TC = L^px)- 

Finally, we denote by L\ the space of functions which admit an l\ ex- 
pansion in the dictionary when the elements are normalized in the empirical 
norm || • || n . This space is again equipped with a norm defined as the small- 
est £\ norm among every admissible expansion. Similarly to || • || n , this norm 
depends on the realization of the design x. 

3.2. Error analysis. In this section, we establish our main result, which 
will allow us, in the next section, to analyze the performance of the estimator 
under various smoothness assumptions on f p . 

Theorem 3.1. There exists kq depending only on B and a such that if 
k > kq, then for all k > and for all functions h in Span(P m ), the estimator 
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satisfies 

(3.10) E(\\f - f p f) < 8^%^ + 2\\f p - h f + C^, 

k n 

where the constant C only depends on n, B and a. 

The proof of Theorem 3.1 relies on a few preliminary results that we 
collect below. The first is a direct application of Theorem 3 from [18] or 
Theorem 11.4 from [11]. 

Lemma 3.2. Let T he a class of functions which are all bounded by B. 
For all n and a, (3 > 0, we have 

Pi{3f G T: ||/ - / p || 2 > 2(112/ - ff n - \\y - f p \\ 2 n ) +« + /?} 

(3.11) 

where x= (xi, . . . ,x n ) S X n and AT(i, J 7 , Li(f x )) is the covering number for 
the class J- by balls of radius t in Li(y x ), with v x := - Ya=i the empirical 
discrete measure. 

Proof. This follows from Theorem 11.4 of [11] by taking e = 1/2 in 
that theorem. □ 

We shall apply this result in the following setting. Given any set AcP, 
we define Q\ := span{g : g £ A} and denote by TQ\ := {Tf : f £ Q\\ the set 
of all truncations of the elements of Qa, where T = Tb as before. We then 
define 

(3.12) T k := |J rg A . 

AC© m ,#(A)<fc 

The following result gives an upper bound for the entropy numbers J\f(t,Tk, 

Lemma 3.3. For any probability measure v , for any t > and for any 
A with cardinality k, we have the bound 

(3.13) M(t, TQ A , L^u)) < 3^ log ^) " + \ 



Additionally, 



J2eB 3eB 



(3.14) AA(t,^,L 1 ( l .))<3n afc (^log- 



fe+i 
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Proof. For each A with cardinality k, Theorem 9.4 in [11] gives 



(3.15) 



A^,TSa,£i(^))<3 



2eB 3eB\ VA 



with V\ the VC dimension of the set of all subgraphs of TQ^. It is easily 
seen that Va is smaller than the VC dimension of the set of all subgraphs 
of Q\, which, by Theorem 9.5 in [11], is less than k + 1. This establishes 
(3.13). Since there are less than n ak possible sets A, the result (3.14) follows 
by taking the union of the coverings for all TQ\ as a covering for J-}.. □ 

Finally, we will need a result that relates the L\ and norms. 



Lemma 3.4. Given any dictionary T>, for all functions h in Span(P), 
we have 



(3.16) 



E 



n) < 
1 ' ~ 



2 

Ci- 



Proof. We normalize the elements of the dictionary in 
Given any h = J2g£T> c g9 an d an y z of length n, we write 



(3.17) 



g eT> 



9\\n 



with Cg :=c g \\g\\n- We then observe that 



E l c 9 c 9'l £: (llfi , l|n||5 / ||n) 
(g,g')eVxV 

< \c g c g/ \(E(\\9\\l)E(\\9'\\ 2 n)) 1/2 

(g,g')dVxV 



E 

(g,g')£T>xT> 

E I 

(g,g')€VxV 



\c g c g >\(\\gf\\9>f) 1/2 



\n- 



= EH • 

The result follows by taking the infimum over all possible admissible (c g ) 
and using the fact that 



(3.18) 



E^inf 



EKI 

.gev 



<inf£ 



1 2\ 



E K\ 

lg£V 



□ 
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Proof of Theorem 3.1. We write 
(3-19) ||/-/ p || 2 = T 1 +T 2 , 

where 

(3.20) T X := ||/ - f p f - 2 - f\\l ~ \\v ~ f P \\l + 
and 

(3-21) r 2 :=2(||y-/|| 2 -|| y -/X + «^^ 

From the definition of our estimator, we know that for all k > 0, we have 

(3-22) T 2 < 2 (||y - f k \\l - \\y - f p f n + • 

Therefore, for all /c > and /i G £ 2 (/5x)> we have T 2 < T3 + T4 with 

(3-23) T 3 :=2(\\y-f k f n -\\y-hf n ) 

and 

(3.24) T 4 := 2(][2/ - C " llj/ " /plln) + 2^^^. 

n 

We now bound the expectations of Ti, T3 and T4. For the last term, we have 



2 i n klogn 



E(T A ) = 2E(\y - h{x)\ 2 - \y - f p {x)\ 2 ) + 2k 

(3.25) 

= 2\\f p -hf + 2k 

n 

For T3, we know from Theorems 2.3 and 2.4 that we have 

(3-26) || y _ /fc ||2_|| 2/ _ /l ||2< 4 ^ L _ 

Using, in addition, Lemma 3.4, we thus obtain 

2 

£1 



(3.27) E(T 3 ) < 8 
For Ti, we introduce Q, the set of z G Z n for which 

(3.28) 11/ - f p f > 2{\\y - f\\ 2 n — — f P \\l + K^E2) 

Since Ti < ||/ - / p || 2 + 2||y - / p || 2 < 6£ 2 , we have 

(3.29) E(Ti) <eB 2 Px(Q). 
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We thus obtain that for all k > and for all h € L2(px), we have 



(3.30) - f p \\ 2 ) < + 2\\f p - hf + 2k^^ + 6B 2 Pr(fi). 

It remains to bound Pr(f2). Since k* can take an arbitrary value depending 
on the sample realization, we simply control this quantity by the union 
bound 

]T Prhf€? k : 

\\f-f p f>2(\\y-ff n -\\y-f p \\l) + ^ 



KKBn/s 

(3.31) 



n 

Denoting by pj~ each term of this sum, we obtain, by Lemma 3.2, that 

(3.32) Pk < 14supAA( i ^,^,L 1 (, x) ) ex P (-^_), 

provided a + [3 < 2K fcl ° gn . Assuming without loss of generality that k > 1, 
we can take a := K fcl ° gTO and /3 = 1/n, from which it follows that 

(3.33) p fc < 14supJ\rf-4-,^ fc ,L 1 (i/ x ) > )n-' efc / 2568fl4 . 

Using Lemma 3.3, we finally obtain 

(3.34) p fc < C n afc n 2 ( fc+1 )n- Kfc / 2568i?4 , 

so that by choosing k> kq = 2568-B 4 (a + 5), we always have p k < Cn" 2 . It 
follows that 

(3.35) Pr(ft)< V p fc <-. 

— 72 

KBii/k 

This contribution is therefore absorbed into the term 2K fclogn in the main 
bound and this completes our proof. □ 

Remark 3.5. The value k = 2568B 4 (a + 5) is a pessimistic estimate 
due to the large numerical constant. In practice, this may result in selecting 
too small a value for k* . An alternative approach to the complexity penalty 
for choosing k* is the so-called "hold out" method, which, in the present case, 
would consist in (i) splitting the sample set into two independent subsets 
{1, . . . , n} and {n + 1, . . . , n}, (ii) using the first subset to build the sequence 
(fk)k>0 an d (iii) using the second subset to select a proper value of k*. 
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3.3. Rates of convergence and universal consistency. In this section, we 
apply Theorem 3.1 in several situations which correspond to different prior 
assumptions on f p . In order to control the approximation error resulting from 
using the truncated dictionary T> m in our algorithm, we take an arbitrary 
but fixed a > and take the size m exactly of the order of n a , 

(3.36) m = m(n) := [n a \ . 

We first consider the case where f p E C\^ T . In that case, we know that for 
all m, there exists h G Span(2? m ) such that H^Hz^D™) — M and \\f p — h\\ < 
Mm~ r , with M := H/pH^ r - Therefore, Theorem 3.1 yields 

(3.37) - f p f) < C^n(^ + M 2 n" 2 - + ^) . 

In order that the effect of truncating the dictionary does not affect the 
estimation bound, we make the assumption that 2ar > 1. This allows us to 
delete the middle term in (3.37). Note that this is not a strong additional 
restriction over f p € C\ since a can be fixed arbitrarily large. 



Corollary 3.6. // f p e A,r with r > l/2a, then 
(3.38) S(||/-/p|| 2 )<C(l + ||/ p || A , r ) 



n l ' 2 



logn 

PROOF. We take k := \{M + 1) 2 ^! 1/2 in (3.37) and obtain the desired 
result. □ 

We next consider the case where f p G B Ptr . In that case, we know that for 
all m and for all t > 0, there exists h G Span(2? m ) such that H/iH/^ < Mt e ~ l 
[see (2.62)] and \\f p -h\\ < M(t e + t e ~ 1 m- r ) [see (2.64)], with l/p = (l + 0)/2 
and with M = ||/||s_ r . Taking t = k~ l l 2 and applying Theorem 3.1, we 
obtain 

, ^(II/-/pI| 2 ) 
(3.39) 

< Cminf M 2 k~ 2s + M 2 (k~ s + k~ s+1 / 2 n~ ar ) 2 + * 



fc>o V n 

with s = l/p— 1/2. We now impose that or > 1/2, which allows us to delete 
the term involving n~ ar . 

Corollary 3.7. // f p G B PjT with r > I /(2a), then 
E(\\f-f P \\ 2 ) 



(3.40) 



<C(l+\\f P Wr) 2/{2S + 1 \ 

\ log n 

with C a constant depending only on k, B and a. 



n 



-2s/(l+2s) 
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PROOF. We take k := \{M + l) 2 ^;! 1/(1+2s) in (3.39) and obtain the 
desired result. □ 

Let us finally prove that the estimator is universally consistent. For this, 
we need to assume that the dictionary T> is complete in L<i{px)- 

Theorem 3.8. For an arbitrary regression function, we have 

(3.41) lim £(||/-/ p || 2 ) = 0. 

n— >+oo 

Proof. For any e > and n sufficiently large, there exists h G Span(D. m ) 
which satisfies \\f p — h\\ <e. According to Theorem 3.1, we thus have 

(3.42) E{\\f-fX) < Cmin(M^ + 
Taking k = n 1 / 2 , this gives 

(3.43) / - /J 2 ) < C{e 2 + n- 1 ' 2 logn), 
which is smaller than 2Ce 2 for sufficiently large n. □ 

Finally, we remark that although our results for the learning problem are 
stated and proved for the OGA and RGA, they hold equally well when the 
SPA is employed. 

4. Neural networks. Neural networks have been one of the main mo- 
tivations for the use of greedy algorithms in statistics [2, 4, 14, 18]. They 
correspond to a particular type of dictionary. One begins with a univariate 
positive and increasing function a that satisfies cr(— oo) = and cr(+oo) = 1 
and defines the dictionary consisting of all functions 

(4.1) x i— > a((v, x) + w) 

for all vectors v G K B and scalars wef, where D is the dimension of the 
feature variable x. Typical choices for a are the Heaviside function h = Xx>o 
or more general sigmoidal functions which are regularized versions of h. 

In [18], the authors consider neural networks where a is the Heaviside 
function and the vectors v are restricted to have at most d nonzero coordi- 
nates (d-bounded fan-in) for some fixed d < D. We denote this dictionary 
by T>. With this choice of dictionary and using the standard relaxed greedy 
algorithm, they establish the convergence rate 



(4.2) 
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where f a is the projection of f p onto the convex hull of the dictionary T>. 
This can also be expressed by 

(4-3) E(\\fl-f^)<Uk d l -^l, 

which shows that with the choice k = ra 1//2 , the estimator converges to f a 
with rate n" 1 / 2 up to a logarithmic factor. In particular, the algorithm is 
not universally consistent since it only converges to the regression function 
when it belongs to this convex hull. 

4.1. Convergence results. Let us apply our results to this particular set- 
ting. We first want to note that in this case, it is, from a theoretical point of 
view, not necessary to truncate the dictionary T> into a finite dictionary in 
order to achieve our theoretical results. The truncation of dictionaries was 
used in the proof of Theorem 3.1 to bound the covering numbers of the sets 
J-k through the bound established in Lemma 3.3. In the specific case of T>, 
one can bound these covering numbers without truncation. Let us note that 
in this case, 

(4.4) T k := (J TG A , 

AcX>,#(A)<fc 

where we no longer have the restriction that A is in T> m . 

Lemma 4.1. For the dictionary D, any probability measure v of the type 
v = \ Ya=i $xi an d any k > and t > 0, we have the bound 

(4.5) N(t, KMV)) < 3(n + l) HD+1) log ^f) 
where the sets J-k are defined as in (4.4). 

PROOF. As in the proof of Lemma 3.3, we first remark that 

(4.6) A T(t,ra A ,L 1 (^)<3^1og^J . 

We next use two facts from Vapnik-Chervonenkis theory (see, e.g., Chapter 
9 in [11]): (i) if A is a collection of sets with VC dimension A, then there are 
at most (n + 1) A sets of A separating the points (xi, . . . , x n ) in different ways; 
(ii) the VC dimension of the collection of half-hyperplanes in MP has VC 
dimension D + 1. It follows that there are at most (n + 1) D+1 hyperplanes 
separating the points (xi, . . . ,x n ) in different ways and therefore there are 
at most (n + l) fc ( D+1 ) ways of selecting (gi, . . . , g^) in T> which will give 
different functions on the sample (xi, . . . ,x n ). The result follows by taking 
the union of the coverings on all possible /c-dimensional subspaces. □ 



LEARNING BY GREEDY ALGORITHMS 



29 



Remark 4.2. Under the d-bounded fan-in assumption, the factor (n + 
ljk(D+i) can rec luced to D M {-^j) k ^ d+x ^; see the proof of Lemma 3 in 
[18]. 

Based on this bound, a brief inspection of the proof of Theorem 3.1 shows 
that its conclusion still holds, now with kq depending on B and D. It follows 
that the rates of convergence in Corollaries 3.6 and 3.7 now hold under the 
sole assumptions that / G C\ and / G B p , respectively. On the other hand, 
the universal consistency result in Theorem 3.8 requires that the dictionary 
is complete in L2(px), which only holds when d = D, that is, when all 
direction vectors are considered. 

Theorem 3.1 improves the bound (4.2) of [18] in two ways: on the one 
hand, f a is replaced by an arbitrary function h which can be optimized and 
on the other hand, the value of k can also be optimized. 

Remark 4.3. Note that truncating the dictionary is still necessary to 
obtain a practical scheme. Such a truncation can be achieved by restricting 
to a finite number m of direction vectors v, which typically grows together 
with sample size n. In this case, we need to consider the spaces C\ yT and 
B Pt r, which contain an additional smoothness assumption compared to L\ 
and B v . This additional amount of smoothness is meant to control the error 
resulting from the discretization of the direction vectors. We refer to [20] for 
general results on this problem. 

4.2. Smoothness conditions. Finally, let us briefly discuss the meaning 
of the conditions / G C\ and / G B p in the case of a dictionary D consisting 
of the functions (4.1) for a fixed a and for all v G ~§l D and w G M. The 
following can be deduced from a classical result obtained in [4]: assuming 
that the marginal distribution px is supported in a ball B r := {\x\ < r}, for 
any function / having a converging Fourier representation 

(4.7) f(x)= Jffiuy^du, 
the smoothness condition 

(4.8) J \oj\\Ff{u>)\duj<+oo 
ensures that 

(4-9) < (2rC f + |/(0)|) < 2rC f + \\f\\ Loo , 

with Cf := ^ \uj\\!F f {uj)\ doj . Barron actually proves that condition (4.8) en- 
sures that f(x) — /(0) lies in the closure of the convex hull of D multiplied by 
2rCj, the closure being taken in L<i{px) and the elements of the dictionary 
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being normalized in L^. The bound (4.9) follows by remarking that the 
norm controls the L2{px) norm. 

We can therefore obtain smoothness conditions which ensure that f £ B p 
by interpolating between the condition uFf(u) G L\ and / G L 2 (px)- 

In the particular case where px is continuous with respect to the Lebesgue 
measure, that is, px(A) < c\A\, we know that a sufficient condition to have 
/ G L 2 (px) is given by Tf G L 2 . 

We can then rewrite the two conditions that we want to interpolate as 
\u\- l \Tf{u)\ G Li(M 2 dw) and \u)\~ l \F f G L 2 (\uo\ 2 dio) . Therefore, by 
standard interpolation arguments, we obtain that a sufficient condition for 
a bounded function / to be in B p is given by 

(4.10) Iwr^/MI ewL p (\Lu\ 2 cLu) 
or, in other words, 

(4.11) sup?7 P / X{|^/H|>r,M}M 2 <^ < +oo. 

r]>0 J 

A slightly stronger, but simpler, condition is 

(4.12) \uj\- l \Ff(uj)\£L p (\uj\ 2 duj), 
which reads 

(4.13) J \Lo\ 2 ~ p \Tf{uo)\ p duo <+oo. 

When px is arbitrary, a sufficient condition for / G L 2 {px) is Tf G L\, 
which actually also ensures that / G L^. In that case, we can again apply 
standard interpolation arguments and obtain that a sufficient condition for 
a bounded function / to be in B p is given by 

(4.14) sup^ 1 " 2 ^ / l-F/HI duj <+oo. 
A>0 J\oj\>A 

A slightly stronger, but simpler, condition is 

(4.15) J |w| 2/p - 1 |J-/H|^<+oo. 
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